DiscoverLessWrong (30+ Karma)[Linkpost] “Eliciting secret knowledge from language models” by Arthur Conmy, Bartosz Cywiński, Sam Marks
[Linkpost] “Eliciting secret knowledge from language models” by Arthur Conmy, Bartosz Cywiński, Sam Marks

[Linkpost] “Eliciting secret knowledge from language models” by Arthur Conmy, Bartosz Cywiński, Sam Marks

Update: 2025-10-03
Share

Description

This is a link post.

TL;DR: We study secret elicitation: discovering knowledge that AI has but doesn’t explicitly verbalize. To that end, we fine-tune LLMs to have specific knowledge they can apply downstream, but deny having when asked directly. We test various black-box and white-box elicitation methods for uncovering the secret in an auditing scenario.

See our X thread and full paper for details.

Training and auditing a model with secret knowledge. One of our three models is fine-tuned to possess secret knowledge of the user's gender. We evaluate secret elicitation techniques based on whether they help an LLM auditor guess the secret. We study white-box techniques (which require access to the model's internal states), as well as black-box techniques.

Summary

  • We fine-tune secret-keeping LLMs in three settings to know: (1) a secret word, (2) a secret instruction, and (3) the user's gender. Models are trained to apply this secret [...]

---

Outline:

(01:05 ) Summary

(02:24 ) Introduction

---


First published:

October 2nd, 2025



Source:

https://www.lesswrong.com/posts/Mv3yg7wMXfns3NPaz/eliciting-secret-knowledge-from-language-models-1



Linkpost URL:
https://arxiv.org/abs/2510.01070


---


Narrated by TYPE III AUDIO.


---

Images from the article:

Training and auditing a model with secret knowledge. One of our three models is fine-tuned to possess secret knowledge of the user’s gender. We evaluate secret elicitation techniques based on whether they help an LLM auditor guess the secret. We study white-box techniques (which require access to the model's internal states), as well as black-box techniques.
Our three secret-keeping models. The Taboo model possesses a secret keyword (

Apple Podcasts and Spotify do not show images in the episode description. Try Pocket Casts, or another podcast app.

Comments 
In Channel
loading
00:00
00:00
1.0x

0.5x

0.8x

1.0x

1.25x

1.5x

2.0x

3.0x

Sleep Timer

Off

End of Episode

5 Minutes

10 Minutes

15 Minutes

30 Minutes

45 Minutes

60 Minutes

120 Minutes

[Linkpost] “Eliciting secret knowledge from language models” by Arthur Conmy, Bartosz Cywiński, Sam Marks

[Linkpost] “Eliciting secret knowledge from language models” by Arthur Conmy, Bartosz Cywiński, Sam Marks